Allow PTransforms to be applied directly to dataframes. #25919

robertwb · 2023-03-22T00:40:04Z

This'll be especially handy for applying RunInference.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

robertwb · 2023-03-22T00:40:50Z

R: @damccorm

github-actions · 2023-03-22T00:41:54Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

sdks/python/gen_protos.py

damccorm · 2023-03-22T14:56:37Z

sdks/python/apache_beam/dataframe/frames.py

+
+  def _maybe_elementwise_or(self, right):
+    if isinstance(right, PTransform):
+      return convert.to_dataframe(convert.to_pcollection(self) | right)


I think this is fine, it is worth calling out that we're opening users up to doing an inefficient thing where they go:

df = convert.to_dataframe(pc) df2 = df | MyDfOperation() df3 = df2 | MySchemaTransform() | MySchemaTransform2() | MySchemaTransform3() result = convert.to_dataframe(df3)

which would be (much?) more efficient written as:

df = convert.to_dataframe(pc) df2 = df | MyDfOperation() result = convert.to_dataframe(df2) | MySchemaTransform() | MySchemaTransform2() | MySchemaTransform3()

because this avoids the repeated to_dataframe/to_pcollection transition. The former is probably more natural though, especially if you have real df operations mixed in there even if its less efficient. I think the user experience still trumps the efficiency loss, but it might be something we want to doc.

Added a note. Once we push the batching stuff through, there could be little to no overhead here, but that's future work.

codecov · 2023-03-22T22:36:20Z

Codecov Report

Merging #25919 (e761360) into master (fcf51fb) will decrease coverage by 0.01%.
The diff coverage is 90.90%.

@@            Coverage Diff             @@
##           master   #25919      +/-   ##
==========================================
- Coverage   71.41%   71.41%   -0.01%     
==========================================
  Files         778      778              
  Lines      102420   102441      +21     
==========================================
+ Hits        73146    73156      +10     
- Misses      27818    27829      +11     
  Partials     1456     1456

Flag	Coverage Δ
python	`79.95% <90.90%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/python/apache_beam/dataframe/frames.py	`95.31% <90.90%> (-0.03%)`	⬇️

... and 12 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

robertwb · 2023-03-23T17:46:46Z

Just checking whether "I think this is fine" is an LGTM.

damccorm

Just checking whether "I think this is fine" is an LGTM.

LGTM - I did want to block on the proto changes which is why I didn't give the explicit LGTM, sorry for the ambiguity

robertwb · 2023-03-23T18:39:10Z

No problem. And, yes, the proto changes were merged separately.

github-actions bot added the python label Mar 22, 2023

robertwb force-pushed the df-apply branch from c772bd8 to e74a5cd Compare March 22, 2023 00:45

damccorm reviewed Mar 22, 2023

View reviewed changes

robertwb added 2 commits March 22, 2023 15:08

Allow PTransforms to be applied directly to dataframes.

fee461d

Add note about efficiency.

029451b

robertwb force-pushed the df-apply branch from e74a5cd to 029451b Compare March 22, 2023 22:13

mypy

e761360

damccorm approved these changes Mar 23, 2023

View reviewed changes

robertwb merged commit 4484c19 into apache:master Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow PTransforms to be applied directly to dataframes. #25919

Allow PTransforms to be applied directly to dataframes. #25919

robertwb commented Mar 22, 2023

robertwb commented Mar 22, 2023

github-actions bot commented Mar 22, 2023

damccorm Mar 22, 2023

robertwb Mar 22, 2023

codecov bot commented Mar 22, 2023 •

edited

Loading

robertwb commented Mar 23, 2023

damccorm left a comment

robertwb commented Mar 23, 2023

Allow PTransforms to be applied directly to dataframes. #25919

Allow PTransforms to be applied directly to dataframes. #25919

Conversation

robertwb commented Mar 22, 2023

GitHub Actions Tests Status (on master branch)

robertwb commented Mar 22, 2023

github-actions bot commented Mar 22, 2023

damccorm Mar 22, 2023

Choose a reason for hiding this comment

robertwb Mar 22, 2023

Choose a reason for hiding this comment

codecov bot commented Mar 22, 2023 • edited Loading

Codecov Report

robertwb commented Mar 23, 2023

damccorm left a comment

Choose a reason for hiding this comment

robertwb commented Mar 23, 2023

codecov bot commented Mar 22, 2023 •

edited

Loading